class: center, middle, inverse, title-slide # Tidyverse Intro II ### Antoine & Nicolas ### cynkra GmbH ### February 1, 2022 --- <style type="text/css"> .pull-left { margin-top: -25px; } .pull-right { margin-top: -25px; } .remark-code { font-size: 12px; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> # Introduction Organization of half-day R courses: - Intro courses: * Tidyverse intro I * Base R intro/Tidyverse intro II (this course) * Data visualization I * Data visualization II - Advanced courses: * Advanced tidyverse (this afternoon) * R package creation * Working with database systems * Parallelization & efficient R programming * Advanced topics (tbd) --- # Course material Our course material currently is available from Github at https://github.com/cynkra/bag-courses Today we will be looking at the folder `1-2_intro_tidy-ii`  --- # General remarks - Even though we are starting out remotely, we hope for these courses to be interactive: go ahead and ask if something is unclear! - You can also write into the chat, which I will try to monitor when Antoine is presenting. - We were asked to provide recordings of the courses for those of you who cannot join, so recording is activated. - Per course unit, we offer 4 hours of follow up time; approach us with questions (nicolas@cynkra.com)! --- # RStudio Intro  --- # Assignment in R Assignment means we *bind* a *value* to a *name* in an *environment*. .pull-left[ ```r a <- FALSE b <- "a" c <- 2.3 d <- c(1, 2, 3) ``` ] .pull-right[ <img src="data:image/png;base64,#bindings.png" width="50%" style="display: block; margin: auto;" /> ] The assignment operator `<-` performs this *binding* in the current environment, here the global environment (`.GlobalEnv`). .pull-left[ ```r 0a <- 1 ``` ``` ## Error: unexpected symbol in "0a" ``` ] .pull-right[ ```r if <- 1 ``` ``` ## Error: unexpected assignment in "if <-" ``` ] There are some rules as to what names are permissible. --- # Retrieving values A value can be accessed via its name (not a string!). .pull-left[ ```r a ``` ``` ## [1] FALSE ``` ] .pull-right[ ```r b ``` ``` ## [1] "a" ``` ] (If it is accessible from the current environment.) ```r some_crazy_name ``` ``` ## Error: object 'some_crazy_name' not found ``` Bindings in an environment can be listed using `ls()`. .pull-left[ ```r ls() ``` ``` ## [1] "a" "b" "c" "d" ``` ] .pull-right[ ```r ls(envir = new.env()) ``` ``` ## character(0) ``` ] --- # What objects are accessible? .pull-left[ ```r ls() ``` ``` ## [1] "a" "b" "c" "d" ``` ] .pull-right[ ```r mean ``` ``` ## function (x, ...) ## UseMethod("mean") ## <bytecode: 0x7fdfc7b28fb0> ## <environment: namespace:base> ``` ] .pull-left[ ```r search() ``` ``` ## [1] ".GlobalEnv" "package:stats" ## [3] "package:graphics" "package:grDevices" ## [5] "package:utils" "package:datasets" ## [7] "package:colorout" "package:devtools" ## [9] "package:usethis" "package:methods" ## [11] "Autoloads" "package:base" ``` ] .pull-right[ ```r library(readr) search() ``` ``` ## [1] ".GlobalEnv" "package:readr" ## [3] "package:stats" "package:graphics" ## [5] "package:grDevices" "package:utils" ## [7] "package:datasets" "package:colorout" ## [9] "package:devtools" "package:usethis" ## [11] "package:methods" "Autoloads" ## [13] "package:base" ``` ] <img src="data:image/png;base64,#search-path.png" width="75%" style="display: block; margin: auto;" /> --- # Base R vector classes I The function `c()` can be used to combine objects, such as literals. ```r x_log <- c(TRUE, FALSE) # same as c(T, F) x_int <- c(1L, 2L, 3L) # use 1L to enforce integer, rather than numeric x_num <- c(1, 2, 6.3, 0.12) # also called 'double' x_chr <- c("Hello World") # or 'Hello World' ``` We can check the type using `class()` and the length with `length()`. .pull-left[ ```r class(x_log) ``` ``` ## [1] "logical" ``` ```r class(c(x_int, x_chr)) ``` ``` ## [1] "character" ``` ] .pull-right[ ```r length(x_chr) ``` ``` ## [1] 1 ``` ```r length(c(x_int, x_chr)) ``` ``` ## [1] 4 ``` ] There is no type distinction between scalar and vector values! ??? There is a certain order in the list above: `logical` is the least flexible type, while `character` is the most flexible. If you combine vectors of different type, the more flexible class will win --- # Base R vector classes II The class of a vector can safely be changed to a more "general" type. .pull-left[ ```r as.logical(c(1, 0, 2)) ``` ``` ## [1] TRUE FALSE TRUE ``` ```r as.integer(c(TRUE, FALSE, TRUE)) ``` ``` ## [1] 1 0 1 ``` ] .pull-right[ ```r as.numeric(c("1", "2")) ``` ``` ## [1] 1 2 ``` ```r as.character(c(TRUE, FALSE)) ``` ``` ## [1] "TRUE" "FALSE" ``` ] A change to a more "specific" general type is also possible. ```r as.numeric(c("hi", "number", "1")) ``` ``` ## Warning: NAs introduced by coercion ``` ``` ## [1] NA NA 1 ``` ??? These functions will always work if you coerce towards greater flexibility. If you want to go the other way, it may give you `NA`s and some warnings. --- # Sequences & repetitions Often we need to create vectors with patterns, such as sequences .pull-left[ ```r 1:5 ``` ``` ## [1] 1 2 3 4 5 ``` ```r seq(1, 5) ``` ``` ## [1] 1 2 3 4 5 ``` ] .pull-right[ ```r seq(3, 9, by = 2) ``` ``` ## [1] 3 5 7 9 ``` ```r seq(0, 1, length.out = 5) ``` ``` ## [1] 0.00 0.25 0.50 0.75 1.00 ``` ] or repetitions .pull-left[ ```r rep(5, 4) ``` ``` ## [1] 5 5 5 5 ``` ```r rep("hello", 2) ``` ``` ## [1] "hello" "hello" ``` ] .pull-right[ ```r rep(1:4, 2) ``` ``` ## [1] 1 2 3 4 1 2 3 4 ``` ```r rep(1:4, each = 2) ``` ``` ## [1] 1 1 2 2 3 3 4 4 ``` ] --- # Basic arithmetic Operators `+`, `-`, `*`, `/`, etc. are implemented as functions .pull-left[ ```r 2 + 3 ``` ``` ## [1] 5 ``` ] .pull-right[ ```r `+`(2, 3) ``` ``` ## [1] 5 ``` ] Operations are vectorized (element-wise) .pull-left[ ```r x <- c(1, 2, 4) x + c(5, 0, -1) ``` ``` ## [1] 6 2 3 ``` ```r 1:5 * 2 ``` ``` ## [1] 2 4 6 8 10 ``` ] .pull-right[ ```r x <- c(1, 2, 4) x * c(5, 0, -1) ``` ``` ## [1] 5 0 -4 ``` ```r 1:5 * rep(2, 5) ``` ``` ## [1] 2 4 6 8 10 ``` ] --- # Recycling In case of length mismatch, the shorter vector is recycled. .pull-left[ ```r c(1, 2) + c(6, 0, 9, 20, 22, 11) ``` ``` ## [1] 7 2 10 22 23 13 ``` ] .pull-right[ ```r c(1, 2, 1, 2, 1, 2) + c(6, 0, 9, 20, 22, 11) ``` ``` ## [1] 7 2 10 22 23 13 ``` ] ```r c(1, 2, 3, 4) + c(6, 0, 9, 20, 22, 11) ``` ``` ## Warning in c(1, 2, 3, 4) + c(6, 0, 9, 20, 22, 11): longer object length is not a ## multiple of shorter object length ``` ``` ## [1] 7 2 12 24 23 13 ``` Advice: in general, try to avoid beyond recycling length 1 vectors. --- # Comparison operators .pull-left[ ```r x <- c(1, 2, 4, 2) ``` ] .pull-right[ ```r y <- c(2, 2, 4, 5) ``` ] Inequality: `<`, `>`, `<=`, `>=` .pull-left[ ```r x < 2 ``` ``` ## [1] TRUE FALSE FALSE FALSE ``` ```r x < y ``` ``` ## [1] TRUE FALSE FALSE TRUE ``` ] .pull-right[ ```r x <= 2 ``` ``` ## [1] TRUE TRUE FALSE TRUE ``` ```r x <= y ``` ``` ## [1] TRUE TRUE TRUE TRUE ``` ] Equality (and its negation): `==`, `!=` .pull-left[ ```r x == y ``` ``` ## [1] FALSE TRUE TRUE FALSE ``` ] .pull-right[ ```r x != y ``` ``` ## [1] TRUE FALSE FALSE TRUE ``` ] --- # Logical operators .pull-left[ ```r (a <- x >= 2) ``` ``` ## [1] FALSE TRUE TRUE TRUE ``` ] .pull-right[ ```r (b <- x < 4) ``` ``` ## [1] TRUE TRUE FALSE TRUE ``` ] Boolean operators: `&` (AND) or `|` (OR), `!` (NOT) .pull-left[ ```r a & b ``` ``` ## [1] FALSE TRUE FALSE TRUE ``` ```r a | b ``` ``` ## [1] TRUE TRUE TRUE TRUE ``` ] .pull-right[ ```r !a & b ``` ``` ## [1] TRUE FALSE FALSE FALSE ``` ```r !(a & b) ``` ``` ## [1] TRUE FALSE TRUE FALSE ``` ] We also have `&&` and `||` (discussed later) --- # Numeric indexing .pull-left[ ```r x <- c(1.2, 3.9, 0.4, 0.12) ``` ] .pull-right[ ```r i <- 3:4 ``` ] We can extract values from a vector, using a numeric index. .pull-left[ ```r x[c(1, 3)] ``` ``` ## [1] 1.2 0.4 ``` ```r x[c(1, 1, 3)] ``` ``` ## [1] 1.2 1.2 0.4 ``` ```r x[-1] ``` ``` ## [1] 3.90 0.40 0.12 ``` ] .pull-right[ ```r x[2:3] ``` ``` ## [1] 3.9 0.4 ``` ```r x[i] ``` ``` ## [1] 0.40 0.12 ``` ```r x[-i] ``` ``` ## [1] 1.2 3.9 ``` ] --- # Logical indexing .pull-left[ ```r x <- c(1.2, 3.9, 0.4, 0.12) x ``` ``` ## [1] 1.20 3.90 0.40 0.12 ``` ] .pull-right[ ```r (i <- rep(c(TRUE, FALSE), each = 2)) ``` ``` ## [1] TRUE TRUE FALSE FALSE ``` ] We can extract values from a vector, using a logical index. .pull-left[ ```r x[i] ``` ``` ## [1] 1.2 3.9 ``` ```r x[!i] ``` ``` ## [1] 0.40 0.12 ``` ```r x[x > 2] ``` ``` ## [1] 3.9 ``` ] .pull-right[ ```r x[TRUE] ``` ``` ## [1] 1.20 3.90 0.40 0.12 ``` ```r x[FALSE] ``` ``` ## numeric(0) ``` ```r x[-i] ``` ``` ## [1] 3.90 0.40 0.12 ``` ] ??? last example: complete craziness! `-i` evaluates to `c(-1, -1, 0, 0)`, which selects all but the first element and adds nothing to this --- # Subset assignment Combines a subsetting operation with an assignment. ```r y <- c(1.2, 3.9, 0.4, 0.12) y[c(2, 4)] <- 5 y[c(FALSE, TRUE, FALSE, TRUE)] <- 5 y ``` ``` ## [1] 1.2 5.0 0.4 5.0 ``` ```r y[y > 2] <- 2 y ``` ``` ## [1] 1.2 2.0 0.4 2.0 ``` ```r y[y == 2] <- c(4, 5) y ``` ``` ## [1] 1.2 4.0 0.4 5.0 ``` --- # Exercises (Homework) 1. Create a vector called `v1` containing the numbers 2, 5, 8, 12 and 16. 1. Extract the values at positions 2 and 5 from `v1`. 1. Use `x:y` notation to make a second vector called `v2` containing the numbers 5 to 9. 1. Subtract `v2` from `v1` and look at the result. 1. Generate a vector with 1000 standard-normally distributed random numbers (use `rnorm()`). Store the result as `v3`. Extract the numbers that are bigger than 2. --- # Matrices Internally represented by a column-major vector, with dimensions. .pull-left[ ```r (m <- matrix(c(1, 4, 2, 2, 7, 3), nrow = 2)) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 7 ## [2,] 4 2 3 ``` ] .pull-right[ ```r dim(m) ``` ``` ## [1] 2 3 ``` ] Two indexes are required for subset selection. .pull-left[ ```r m[1, 2] ``` ``` ## [1] 2 ``` ```r m[, c(2, 3)] ``` ``` ## [,1] [,2] ## [1,] 2 7 ## [2,] 2 3 ``` ] .pull-right[ ```r m[1, ] ``` ``` ## [1] 1 2 7 ``` ```r m[1, , drop = FALSE] ``` ``` ## [,1] [,2] [,3] ## [1,] 1 2 7 ``` ] --- # Exercises (Homework) 1. Create a 10 x 10 matrix that contains a sequence of numbers (use the `:` notation). 1. Extract the 2. column of the matrix. 1. Extract the 5. row of the matrix. 1. Extract the 5. and the 6. row of the matrix. 1. Compare the classes of the results to the previous two subsetting operations. 1. Modify 3., so that it returns the same class as 4. --- # Lists .pull-left[ ```r (x <- list(u = c(2, 3, 4), v = "abc")) ``` ``` ## $u ## [1] 2 3 4 ## ## $v ## [1] "abc" ``` ] .pull-right[ ```r length(x) ``` ``` ## [1] 2 ``` ] Subsetting can be done with `$`, `[` or `[[` .pull-left[ ```r x$u ``` ``` ## [1] 2 3 4 ``` ```r x[["u"]] ``` ``` ## [1] 2 3 4 ``` ```r x[[1]] ``` ``` ## [1] 2 3 4 ``` ] .pull-right[ ```r x["u"] ``` ``` ## $u ## [1] 2 3 4 ``` ```r x[1:2] ``` ``` ## $u ## [1] 2 3 4 ## ## $v ## [1] "abc" ``` ] --- # List subsetting mnemonic <img src="data:image/png;base64,#lists.png" width="75%" style="display: block; margin: auto;" /> .pull-left[ ```r (x <- list(list(1:3), list(4:6))) ``` ``` ## [[1]] ## [[1]][[1]] ## [1] 1 2 3 ## ## ## [[2]] ## [[2]][[1]] ## [1] 4 5 6 ``` ] .pull-right[ ```r x[[1]][[1]] ``` ``` ## [1] 1 2 3 ``` ```r x[[1]][[1]][1] ``` ``` ## [1] 1 ``` ] --- # Names .pull-left[ ```r (x <- list(u = c(2, 3, 4), v = "abc")) ``` ``` ## $u ## [1] 2 3 4 ## ## $v ## [1] "abc" ``` ] .pull-right[ ```r names(x) ``` ``` ## [1] "u" "v" ``` ] Any vector in R can have a names attribute. .pull-left[ ```r (y <- c(a = 1, b = 2, c = 3)) ``` ``` ## a b c ## 1 2 3 ``` ```r class(y) ``` ``` ## [1] "numeric" ``` ] .pull-right[ ```r y[c("a", "b")] ``` ``` ## a b ## 1 2 ``` ] --- # Data frames 2-dimensional like matrices, but implemented using lists ```r (d <- data.frame(kids = c("Jack", "Jill", "Jamie"), ages = c(12, 10, 7))) ``` ``` ## kids ages ## 1 Jack 12 ## 2 Jill 10 ## 3 Jamie 7 ``` .pull-left[ ```r d[["ages"]] # same as d$ages ``` ``` ## [1] 12 10 7 ``` ```r d[1, ] ``` ``` ## kids ages ## 1 Jack 12 ``` ```r d[, 1] ``` ``` ## [1] "Jack" "Jill" "Jamie" ``` ] .pull-right[ ```r length(d) # same as ncol(d) ``` ``` ## [1] 2 ``` ```r nrow(d) ``` ``` ## [1] 3 ``` ```r dim(d) ``` ``` ## [1] 3 2 ``` ] --- # Exercises (Homework) 1. Generate two random vectors of length 10, `a`, and `b`. Combine them in a list, call it `l1`. 1. Compare the classes of `l1[2]` and `l1[[2]]`. Can you explain the difference? 1. How many rows does the data frame `mtcars` contain? The dataset is available by default. Just try typing `mtcars`. 1. Of what type is the column `vs` of `mtcars`. 1. Try printing the column names of `mtcars`. --- # Control flow .pull-left[ ```r for (i in 1:8) { print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ``` ```r i <- 1 while(i < 5) { print(i) i <- i + 1 } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ``` ] .pull-right[ ```r for (i in 1:8) { if (i < 5) print(-i) else print(i) } ``` ``` ## [1] -1 ## [1] -2 ## [1] -3 ## [1] -4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ``` ```r for (i in 1:8) { if (i == 5) break print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ``` ] --- # Functions We can define and invoke (or call) a function as .pull-left[ ```r half <- function(x) { x / 2 } ``` ] .pull-right[ ```r y <- c(3, 2, 1) half(y) ``` ``` ## [1] 1.5 1.0 0.5 ``` ] The following definitions are equivalent .pull-left[ ```r half <- function(x) { z <- x / 2 z } ``` ] .pull-right[ ```r half <- function(x) { z <- x / 2 return(z) } ``` ] .pull-left[ ```r half <- function(x) x / 2 ``` ] .pull-right[ ```r half <- \(x) x / 2 ``` ] --- # Default arguments ```r fraction <- function(x, denominator = 2) { x / denominator } y <- seq(2, 8, by = 2) ``` .pull-left[ ```r fraction(y) ``` ``` ## [1] 1 2 3 4 ``` ```r fraction(y, 2) ``` ``` ## [1] 1 2 3 4 ``` ```r fraction(y, denominator = 2) ``` ``` ## [1] 1 2 3 4 ``` ] .pull-right[ ```r fraction(y, 4) ``` ``` ## [1] 0.5 1.0 1.5 2.0 ``` ```r fraction(denominator = 4, y) ``` ``` ## [1] 0.5 1.0 1.5 2.0 ``` ```r fraction(denominator = 4, x = y) ``` ``` ## [1] 0.5 1.0 1.5 2.0 ``` ] --- # Variable scope I .pull-left[ ```r frac_1 <- function(x, denominator = 2) { result <- x / denominator result } ``` ] .pull-right[ ```r frac_2 <- function(x) { denominator <- denominator + 1 x / denominator } ``` ] .pull-left[ ```r result <- "hello" frac_1(seq(2, 8, by = 2)) ``` ``` ## [1] 1 2 3 4 ``` ```r result ``` ``` ## [1] "hello" ``` ] .pull-right[ ```r denominator <- 1 frac_2(seq(2, 8, by = 2)) ``` ``` ## [1] 1 2 3 4 ``` ```r denominator ``` ``` ## [1] 1 ``` ] --- # Variable scope II ```r x <- 1 fun_a <- function() x fun_b <- function() { x <- 2 fun_a() } fun_b() ``` .pull-left[ lexical scoping: look at where function is defined ``` ## [1] 1 ``` ] .pull-right[ dynamic scoping: look at where function is called ``` ## [1] 2 ``` ] R is a lexically scoped language. --- # Higher-order functions In R, a function can take a function as an argument or return a function. .pull-left[ ```r frac_fun_gen <- function(denominator) { function(x) x / denominator } half <- frac_fun_gen(2) class(half) ``` ``` ## [1] "function" ``` ```r x <- 1:4 half(x) ``` ``` ## [1] 0.5 1.0 1.5 2.0 ``` ] .pull-right[ ```r op_fun <- function(x, y, op_fun) { op_fun(x, y) } x <- 1:4 x / 2 ``` ``` ## [1] 0.5 1.0 1.5 2.0 ``` ```r `/`(x, 2) ``` ``` ## [1] 0.5 1.0 1.5 2.0 ``` ```r op_fun(x, 2, `/`) ``` ``` ## [1] 0.5 1.0 1.5 2.0 ``` ] --- # Base R higher order functions ```r x <- list(a = 1:20, b = 21:40, c = 41:60) ``` .pull-left[ ```r lapply(x, sum) ``` ``` ## $a ## [1] 210 ## ## $b ## [1] 610 ## ## $c ## [1] 1010 ``` ] .pull-right[ ```r res <- list() for (i in names(x)) { res[[i]] <- sum(x[[i]]) } res ``` ``` ## $a ## [1] 210 ## ## $b ## [1] 610 ## ## $c ## [1] 1010 ``` ] (It would be better to pre-allocate a list of the correct length, for example as `res <- vector("list", length(x)`.) --- # Exercises (Homework) 1. Write a function `add_constant` that adds a constant to a vector, and set the default value of the constant to 10. 1. Apply `add_constant` to all columns of `mtcars`. 1. Use `lapply` to calculate the mean of each variable in the `mtcars` dataset. Convert the resulting list to a vector. 1. Using `lapply`, generate a list containing 10 random vectors of random length between 1 and 10. 1. Use the help to see what the `colSums()` function does. Using `apply`, try writing your own version, `colSums2()`.